R for Bio Data Science

TCGA Pan-Cancer Analysis with Emphasis on Survival Analysis

Introduction to TCGA Pan-Cancer Dataset

  • The Cancer Genome Atlas (TCGA)
  • Establishing coordinated effort to characterize molecular events in primary cancers
  • Collection of clinicopathologic annotation data along with multi-platform molecular profiles of more than 11,000 human tumors across 34 different cancer type
  • Our data is composed of the 6 cancer types with most patients included
  • Significance of the Pan-Cancer dataset in cancer genomics research
    • Pan-cancer dataset allows for simultaneous analysis of genomic data across a diverse range of cancer types
  • Dataset is crucial for discovering shared molecular features, potential biomarkers and theurapeutic targets
  • Contributing to more effective and personalized cancer treatments
  • How this dataset helps in understanding cancer at a molecular level
  • Data set helps allowing identification of patterns and shared characteristics among different cancer types

Project Aim and Focus

  • To perform an in-depth TCGA Pan-Cancer analysis, focusing on survival analysis
  • Explore and understand factors influencing survival outcomes within the TCGA dataset
  • Performing Cox Proportional Hazard Models

Data Acquisition and Cleaning

  • Data retrieved and loaded using read_excel()
  • Raw data is a 11.160 x 34 sized tibble:
…1 bcr_patient_barcode type age_at_initial_pathologic_diagnosis gender race ajcc_pathologic_tumor_stage clinical_stage histological_type histological_grade initial_pathologic_dx_year menopause_status birth_days_to vital_status tumor_status last_contact_days_to death_days_to cause_of_death new_tumor_event_type new_tumor_event_site new_tumor_event_site_other new_tumor_event_dx_days_to treatment_outcome_first_course margin_status residual_tumor OS OS.time DSS DSS.time DFI DFI.time PFI PFI.time Redaction
1 TCGA-OR-A5J1 ACC 58 MALE WHITE Stage II [Not Applicable] Adrenocortical carcinoma- Usual Type [Not Available] 2000 [Not Available] -21496 Dead WITH TUMOR NA 1355 [Not Available] Distant Metastasis Peritoneal Surfaces NA 754 Complete Remission/Response NA NA 1 1355 1 1355 1 754 1 754 NA
2 TCGA-OR-A5J2 ACC 44 FEMALE WHITE Stage IV [Not Applicable] Adrenocortical carcinoma- Usual Type [Not Available] 2004 [Not Available] -16090 Dead WITH TUMOR NA 1677 [Not Available] Distant Metastasis Soft Tissue NA 289 Progressive Disease NA NA 1 1677 1 1677 NA NA 1 289 NA
3 TCGA-OR-A5J3 ACC 23 FEMALE WHITE Stage III [Not Applicable] Adrenocortical carcinoma- Usual Type [Not Available] 2008 [Not Available] -8624 Alive WITH TUMOR 2091 NA [Not Available] Distant Metastasis Lung NA 53 Complete Remission/Response NA NA 0 2091 0 2091 1 53 1 53 NA
  • Initial data exploration:
    • E.g. using slice_head, glimpse and names
    • Identify NAs
  • Cleaning:
    • Remove index
    • Assign non-NA missing information to factors
    • Drop columns with >80% missing values
    • Impute remaining NAs: Numeric values to median and categorical values to most frequently occuring string
    • Rename columns for clarification

Data Transformation and Tools

  • Mutate two new columns, by calculating:
    • “age_at_last_followup_years”
    • “disease_duration_years”
  • Mutate a new column “age_group_diagnosis” where patients are divided into age categories
  • Define function to assign new variable “stage_group” based om:
    • “clinical_stage”
    • “pathological_stage”
  • Create subset of the data from the top 6 cancer types:
top_cancer_types <- TCGA_clean |> 
  count(cancer_type) |> 
  arrange(desc(n)) |> 
  slice(1:6) |> 
  pull(cancer_type)

TCGA_aug <- TCGA_aug |> 
  filter(cancer_type %in% top_cancer_types)
  • … Leaving us with a new tibble TCGA_aug sized 3.893 x 33 ready for description and analysis

Exploratory Data Analysis Insights

Survival Analysis Outcomes

  • Main results from the CoxPH model
  • We fitted a CoxPH model to examine impact of various predictors
  • Our predictors include, cancer type, age at diagnosis, gender and race on survival
  • We also performed strata, which is the function used to account for different cancer types

Survival Plot

Interpreting the Model

  • Direct comparison of HR across cancer types without stratification
    • Understanding relative risks associated with each factor
  • GMB stands out - Risk of death nine times greater
  • Age - Risk of death increases slightly with each additional year
  • Gender - No significant difference based on gender
  • Cancer stage - Stage 4 risk of death six times greater

Proportional Hazard Ratio Plot

Implications and Limitations

  • Significance for Oncology
-   Cox model identifies survival-related factors
-   Guides prognosis and treatment strategies
-   Stresses need for early detection and robust treatments
  • Study Limitations
-   Model may omit key influencing factors (lifestyle, socioeconomics)
-   Potential unaccounted variable interactions

Conclusion and Future Outlook

  • Significance for Oncology
  - Employed TCGA Pan-Cancer dataset for survival factor analysis
  - Identified age and cancer type as critical survival predictors
  • Contribution to Precision Medicine
  - Aligns with personalized care based on patient-specific data
  - Promises enhanced predictive accuracy with more genomic information
  • Future Research Trajectory
  - Aims for advanced models to deepen cancer survival understanding
  - Sets a foundation for improved patient treatment and survival rates

Survival Plot